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Abstract 



We study feature selection for fc-means clustering. Although the literature contains many 
methods with good empirical performance, algorithms with provable theoretical behavior have 
. only recently been developed. Unfortunately, these algorithms are randomized and fail with, 

I 1 , say, a constant probability. We address this issue by presenting the first deterministic feature 

C/3 I selection algorithm for /c-means with theoretical guarantees. At the heart of our algorithm lies a 

I ^1 1 deterministic method for decompositions of the identity. 

(N ' 

1^ ■ 1 Introduction 

. This paper is about feature selection for /c-means clustering, a topic that received considerable atten- 
. tion from scientists and engineers. Arguably, A:-means is the most widely used clustering algorithm 
■ in practice [35]. It's simplicity and effectiveness are remarkable among all the available methods [30]. 
On the negative side, using /c-means to cluster high dimensional data with, for example, billions of 
features is not simple and straightforward [16]; the curse of dimensionality makes the algorithm very 
^ ' slow. On top of that, noisy features often lead to overfitting, another undesirable effect. Therefore, 
I reducing the dimensionality of the data by selecting a subset of the features, i.e. feature selection, 
and optimizing the fe-means objective on the low dimensional representation of the high dimensional 
data is an attractive approach that not only will make fc-means faster, but also more robust [15, 16]. 

The natural concern with throwing away potentially useful dimensions is that it could lead to 
high clustering error. So, one has to select the features carefully to ensure that one can recover 
comparably good clusters just using the dimension-reduced data. Practitioners have developed 
numerous feature selection methods that work well empirically [15, 16]. The main focus of this work 
is on algorithms for feature selection with provable guarantees. Recently, Boutsidis et al. described 
a feature selection algorithm that gives a theoretical guarantee on the quality of the clusters that are 
produced after reducing the dimension [5]. Their algorithm, which employs a technique of Rudelson 
and Virshynin [32] , selects the features randomly with probabilities that are computed via the right 
singular vectors of the matrix containing the data ([7] describes a randomized algorithm with the 
same bound but faster running time). Although Boutsidis et al. give a strong theoretical bound 
for the quality of the resulting clusters (we will discuss this bound in detail later), the bound fails 
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with some non negligible probability, due to the randomness in how they sample the features. This 
means that every time the feature selection is performed, the algorithm could (a) fail, and (b) return 
a different answer each time. Since practitioners are not comfortable with such randomization 
and failures, it is rarely the case that a randomized algorithm will make impact in practice. To 
better address the applicability of such feature selection algorithms for fc-means, there is a need 
for deterministic, provably accurate feature selection algorithms. We present the first deterministic 
algorithms of this type. Our contributions can roughly be summarized as follows. 

• Deterministic Supervised Feature Selection (Theorem 2). Given any dataset and any 
/c-partition in this dataset, there is a small set of 0{k) feature dimensions that can produce a 
fe-partition that is no more than a constant factor worse in quality than the given /c-partition. 
Moreover, this small set of feature dimensions can be computed in deterministic low-order 
polynomial time. This is the first deterministic algorithm of this type. Prior work [5, 7] offers 
a randomized algorithm that requires il.{klogk) features for a comparable clustering quality. 

• Existence of a small set of near optimal features (Corollary 3). We prove existence of 
a small set of near optimal features. That is, given any dataset and the number of clusters k, 
there is a set of 0{k) feature dimensions that can produce a /c-partition that is no more than a 
constant factor worse in quality than the optimal /c-partition of the dataset. The existence of 
such a small set of features was not known before. Prior work [5, 7] only implies the existence 
of r2(/clog/c) features with comparable performance. 

• Deterministic Unsupervised Feature Selection (Theorem 4). Given any dataset and 
the number of clusters k, it is possible, in deterministic low-order polynomial time, to select r 
feature dimensions, for any r > k, such that the optimal A;-partition of the dimension-reduced 
data is no more than 0{n/r) worse in quality than the optimal /c-partition; here n is the 
number of features of the dataset. This is the first deterministic algorithm of this type. Prior 
work [5, 7] offers a randomized algorithm with error (3 + 0(A; log /c/r)), for r = n{k\ogk). 

• Unsupervised Feature Selection with a small subset of features (Theorem 5). Fi- 
nally, given any dataset and the number of clusters /c, in randomized low-order polynomial 
time it is possible to select a small number, r, of feature dimensions, with k < r = o(/c log /c), 
such that the optimal /c-partition for the dimension-reduced data is no more than 0(A;log(A;)/r) 
worse in quality than the optimal A;-partition. In particular, this is the first (albeit randomized) 
algorithm of this type that can select a small subset of 0{k) feature dimensions and provide 
an O (log /c) -fact or guarantee. Prior work [5, 7] is limited to selecting r = i}{k\ogk) features. 
The new algorithm combines ideas from this paper with the technique of [5]. 

In order to get deterministic algorithms and be able to select 0{k) feature dimensions, we use 
techniques that are completely different from those used in [5, 7]. Our approach is inspired by a recent 
deterministic result for decompositions of the identity which was introduced in [3] and subsequently 
extended in [4] . This general approach might also be of interest to the Machine Learning and Pattern 
Recognition communities, with potential applications to other problems involving subsampling, for 
example [33, 9]. 
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1.1 Background 

We provide the basic background on /c- means clustering such that to describe our results in Section 2. 
Section 3 provides all the necessary background to describe our algorithms (Section 4) and to prove 
our main results (Section 5). Let us start with the definition of the A;-means clustering problem.^ 
Consider m points in an n-dimensional Eucledian space, 

7^ = {pi,P2,...,p„} eK™, 

and integer k denoting the number of clusters. The objective of /c-means is to find a fc-partition of V 
such that points that are "close" to each other belong to the same cluster and points that are "far" 
from each other belong to different clusters. A /c-partition of P is a collection 

S = {Si,S2, ■■■,Sk} 

of k non-empty pairwise disjoint sets which covers V. Let sj = \Sj\, be the size of Sj. For each set 
Sj, let /Xj € M" be its centroid (the mean point), 

The A:-means objective function is 

m 

J^{V,S)=Y,\\P^-^^{P^)\\l 

i=l 

where /^(pi) is the centroid of the cluster to which pj belongs. The goal of /c-means is to find a 
partition S which minimizes for a given V and k. We will refer to any such optimal clustering as, 

Sopt = argmin J"(P,5). 
S 

The corresponding objective value is 

) Sopt ) ■ 

The goal of feature selection is to construct points 

p = {Pl,P2,...,p„}G]R^ 

^In Section 3, we provide an alternative definition using matrix notation, which will be useful in proving the main 
results of this work. 
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(for some r <^ n specified in advance) by projecting each pj onto r of the coordinate dimensions. 
Consider the optimum A;-means partition of the points in V, 

Sopt = argmin 
s 

The goal of feature selection is to construct a new set of points V such that, 

) Sopt ) ■ 

Here, a is the approximation factor and might depend on m, n, k and r. In words, computing a 
partition Sopt by using the low-dimensional data and plugging it back to cluster the high dimensional 
data, gives an approximation to the optimal value of the fc-means objective function. Notice that 
we measure the quality of Sopt by evaluating the fc-means objective function in the original space, 
an approach which is standard [29, 23, 2]. Comparing Sopt directly to Sopt^ i-e. the identity of the 
clusters, not just the clustering error, would be much more interesting but at the same time a much 
harder (combinatorial) problem. A feature selection algorithm is called unsupervised if it computes 
V by only looking at V and k. Supervised algorithms construct V with respect to a given partition 
Sin of the data. Finally, an algorithm will be a 7-approximation for /c-means (7 > 1) if it finds a 
clustering S^ with corresponding value .F^ < ^Topt- Such algorithms would be useful to state our 
results in a more general way. 

Definition 1. [k-means approximation algorithm] An algorithm is a "'y- approximation" for 
k-means clustering (7 > 1) if it takes inputs the dataset V = {pi, P2> •••) Pm} G 1^" o-nd the number 
of clusters k, and returns a clustering S^ such that, 

= T{V,S^) < ^Topt- 

The simplest algorithm with 7=1, but exponential running time, would try all possible k- 
partitions and return the best. Another example of such an algorithm is in [23] with 7 = 1 + e 
(0 < e < 1). The corresponding running time is 0{mn • 2^'^/'^)'^'^'). Also, the work in [2] describes a 
method with 7 = 0(log k) and running time 0{mnk). The later two algorithms are randomized. For 
other 7-approximation algorithms, see [29] as well as the discussion and the references in [29, 23, 2]. 

2 Statement of our main results 
2.1 Supervised Feature Selection 

Our first result is within the context of supervised feature selection. Suppose that we are given 
points V and some /c-partition Sin- The goal is to find the important features of the points in V 
with respect to the given partition In other words, for any given partition Sin, we would like 
to find the features of the points such that by only using these features and some 7-approximation 
algorithm, we will be able to obtciiii ct partition Sout that has similar performance to Sin- 
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Theorem 2. Fix V = {pi, P25 •••^ Pm} G IK"? Sin, k, and the number of features to he selected 
k < r < n. There is an 0{mninm{m,n} +rk'^n) time deterministic feature selection algorithm that 
constructs V = {pi,P25 •••^Pm} G such that Sout, i-S- the partition that is constructed by running 
some J -approximation method on V, satisfies 

Essentially the clustering Sout is at most a constant factor worse than the original clustering 
Sin', this means we can accomplish compression of the feature dimensions and preserve a specific 
clustering in the data. This is useful, for example, in privacy preserving applications where one seeks 
to release minimal information of the data without destroying much of the encoded information [?]. 
Notice that the feature selection part of the theorem (i.e. the construction of V) is deterministic. 
The 7-approximation algorithm, which can be randomized, is only used to describe the clustering 
that can be obtained with the features returned by our deterministic feature selection algorithm 
(same comment applies to Theorem 4). A complete survey of available approximation algorithms for 
/c-means clustering is beyond the scope of the present work; our focus is on the feature selection. The 
corresponding algorithm is presented as Algorithm 1 in Section 4 and the proof of the theorem is 
given in Section 5. Prior to this result, the best, and in fact the only method with theoretical 
guarantees for this supervised setting [5, 7] is randomized^ and gives 



HV,Sout) < 7 • (3 + 0(V^log(fc)/r)) • H'P,S^n). 

Further, [5, 7] requires r = Q(klogk), otherwise the analysis breaks. We improve this bound by 
0(log/c); also, we allow the user to select any r > k features. 

A surprising existential result is a direct corollary of the above theorem by setting Sin = Sopt and 
assuming a 7 approximation algorithm with 7 = 1 (an algorithm that tries all possible partitions 
can achieve 7 = 1). 

Corollary 3. For any set of points V = {pi,P2, •••,Pm.} G I^""; integer k, and e > 0, there is a set 
of r = 0{k/e^) features V = {pi,P2) •••)Pm} G I^*" such that, if 

Sopt = CLrgminT(V,S) CLnd Sopt = CLrgminJ^(V,S), 
S S 



then, 



HV,Sopt) < {5 + e)T(V 5 Sopt ) • 



In words, for any dataset there exist a small subset of 0{k/e^) features such that the optimal 
clustering on these features is at most a (5 + e)-factor worse that the optimal clustering on the 
original features. Unfortunately, finding these features in an unsupervised manner (without the 



^We should note that [5] describes the result for the unsupervised setting but it's easy to verify that the same 
algorithm and bound apply to the supervised setting as well. 
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knowledge of the optimal partition Sopt) is not obvious; but even the existence of such a small subset 
of good features was not known before; for more discussion see [17, 18, 14, 12, 1]. 



2.2 Unsupervised Feature Selection 

Our second result is within the context of unsupervised feature selection. In unsupervised feature 
selection, the goal is to obtain a /c-partition that is as close as possible to the optimal /c-partition of the 
high dimensional data. The theorem below shows that it is possible to reduce the dimension of any 
high-dimensional dataset by using a deterministic method and obtain some theoretical guarantees 
on the quality of the clusters. 

Theorem 4. Fix V = {pi, P2) Pm} G IR"? k, and the number of features to be selected k < 
r < n. There is an 0(mnmin{?n, ?i} + rk'^n) time deterministic algorithm to construct V = 
{Pi , P2,...,Pr?i} G I^** such that Soutj i-s. the partition that is constructed by running some 7- 
approximation k-means method on V, satisfies 



7 • O {n/r) ■ Fc 



opt' 

The corresponding algorithm is presented as Algorithm 2 in Section 4 and the proof of the 
theorem is given in Section 5. Prior to this result, the best method for this task is in [5, 7], which 
replaces O {n/r) with ^3 + ^-y/A; log(/c)/r^^ and it is randomized, i.e. this bound is achieved only 
with a constant probability. Further, [5, 7] requires r = VL{k\ogk), so one cannot select o(A;logA;) 
features. Clearly, our bound is worse but we achieve it deterministically, and it applies to any r > k. 

Finally, it is possible to combine the algorithm of Theorem 4 with the randomized algorithm 
of [5, 7] and obtain the following result. 

Theorem 5. Fix V = {pi, P25 Pm} G IR"? k, and the number of features to be selected k < 
r < Aklogk. There is an O (mn/c + r/c^ log(A;) + r log r) time randomized algorithm to construct 
V = {pi, p2, Pm.} G such that Sout, i-e. the partition that is constructed by running some 
-y- approximation k-means method on V satisfies, with probability 0.4, 

2^ 



FiV,S..t) < I 1 + 6407 \T.pt 



1 - y/k/r 

(k\ogk\ 
= 7-0 [—^ j • J'opt. 

The corresponding algorithm is presented as Algorithm 3 in Section 4 and the proof of the theorem 
is given in Section 5. Comparing Theorem 5 with Theorem 4, we obtain a much better approximation 
bound at the cost of introducing randomization. Comparing Theorem 5 with the main result of [5, 7], 
we have been able to break the barrier of having to select r = Q{k log A;) features. So, Theorem 5, to 
the best of our knowledge, is the best available algorithm in the literature for unsupervised feature 
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selection in A;-means clustering using O (k) features. If r = 0,{k log k), our algorithm cannot improve 
on [5, 7]. We should note here that the main algorithm of [5] can be obtained as Algorithm 3 in 
Section 4 after ignoring the 4th and 5th steps and replacing the approximate SVD with the exact 
SVD in the first step (the algorithm in [7] uses approximate SVD). 

2.3 Discussion 

All our algorithms require the computation of the top k (approximate) singular vectors of the matrix 
containing the data. Then, the selection of the features is done by looking at the structure of these 
singular vectors and using the deterministic techniques of [3, 4]. We should note that [5, 7] takes a 
similar approach as far as computing the (approximate) singular vectors of the dataset; then [5, 7] 
employ the randomized technique of [32] to extract the features. 

In the introduction, we stated our results for unsupervised feature selection assuming the output 
partition was the optimal one in the reduced-dimension space, Sout = <Sopt- This is the case if we 
assume a 7-approximation algorithm with 7 = 1. Our theorems are actually more general and apply 
to any 7 > 1. 

3 Preliminaries 

We now provide the necessary background such that to be able presenting our algorithms and proving 
our main theorems. 

3.1 Singular Value Decomposition 

We start with the description of the Singular Value Decomposition of a matrix, which plays an 
important role in the description of our algorithms. 

The Singular Value Decomposition (SVD) of A G R™^"- with rank p = rank(A) < min{m, n} is 

UaGK™-xp ' ^ V ' 

with singular values ai > a2 > ■ ■ ■ > ap > contained in G R^^'^ and T^p-k G r(-P-'')^(p-^) . 
Ufc G R"^^'^ and Up_fc G R'">^(/'-'^) contain the left singular vectors of A. Similarly, Vfc G R"-"^^ 
and Vp-k S R"^('^~*^) contain the right singular vectors. We use A+ = VaS^^UJ G R"^'" to 
denote the Moore- Penrose pseudo- inverse of A with denoting the inverse of Ea- We use the 
Frobenius and the spectral matrix norms: ||A||p = Ylij-^ij — J2i=i'^i'-' \\M\2 — ^i- Given A 
and B of appropriate dimensions: ||AB||f < || A||f||B||2. This is a stronger version of the standard 
submultiplicavity property ||AB||f < ||A||f||B||f, which we will refer to as spectral submultiplicavity. 
Let Afc = UfcSfcVfc = AVfcV^ and Ap_fc = A - Afc = Up_fcSp_fcVj_fc. The SVD gives the best 
rank-k approximation to A in both the spectral and Frobenius norms: if rank(A) < k then (for 
e = 2, F) ||A - Afcllg < ||A - A||^; also, ||A - A^Hf = HS^.^Hf- 
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3.2 Approximate Singular Value Decomposition 

The exact SVD of A, though a deterministic algorithm, takes cubic time. More specifically, for any 
A; > 1, the running time to compute the top k left and/or right singular vectors of A G i^^x" 
0(771,71 min{m, n}). We will use the exact SVD in our deterministic feature selection algorithms in 
Theorems 2 and 4. To speed up our randomized algorithm in Theorem 5, we will use a factorization, 
which can be computed fast and approximates the SVD in some well defined sense. We quote a 
recent result from [4] for a relative-error Frobenius norm SVD approximation algorithm. The exact 
description of the algorithm of the following lemma is out of the scope of the present work. 

Lemma 6 (Lemma 13 in [4]). Given A G i^™-x'" qJ rank p, a target rank 2 < k < p, and < e < 1, 
there exists an O [mnk/e) time randomized algorithm that computes a matrix Z £ M"^'^ such that 
Z^Z = Ik, EZ = O^xfc (for E = A - AZZ'^ G W^^^), and 

E[||E||^] < (l + e)||A-Afc||2. 

We use Z = FastApproximateSVD{A,k,e) to denote this randomized procedure. 



3.3 Linear Algebraic Definition of fc- means 

We now give the linear algebraic definition of the A;- means problem. Recall that V = {pi, P2 5 Pm} S 
M" contains the data points, k is the number of clusters, and S denotes a /c-partition of V. Define 
the data matrix A G M™^"- as 

= [Pi,P2, • • • ,Pm\- 

We represent a clustering S by its cluster indicator matrix X G M."^^^. Each column j = 1, . . . ,k of 
X represents a cluster. Each row i = 1, . . . ,m indicates the cluster membership of the point pj. So, 

Xjj = 

if and only data point pi is in cluster Sj {sj = \Sj\). Every row of X has one non-zero element, 
corresponding to the cluster the data point belongs to. There are sj non-zero elements in column j, 
which indicates the points belonging to Sj. The two formulations are related: 

J"(A,X) = IIA-XX'^AIll 

m 

1=1 

m 

1=1 

= Hv.s), 

where we have used the identity p^X"'"A = fi{pi)"^ , for i = 1, ...,m, which can be verified after some 
elementary algebra. Using this formulation, the goal of fc-means is to find an indicator matrix X 
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which minimizes ||A — XX'^A||p. We will denote the best such matrix with, 

Xopt = arg min ||A — XX'^A|||; 

SO, 

•F jpt = ||A — XopfXgpj A||p. 
Since A^ is the best rank k approximation to A, 

II A — A^llp < J- jpt, 

because Xop^X^^A has rank at most k. 

Using the matrix formulation for /c-means, we can restate the goal of a feature selection algorithm 
in matrix notation. So, the goal of feature selection is to construct points C G M"^^*^', where C is 
a subset of r columns from A, which represent the m points in the r-dimensional selected feature 
space. Note also that we will allow rescaling of the corresponding columns. Now consider the 
optimum A;-means partition of the points in C, 

±opt = arg min ||C - XX'^C|||. 

The goal of feature selection is to construct the new set of points C such that, 

||A — SLopt^optMlp ^ a||A — Xop(X^jA|||. 

3.4 Spectral Sparsification and Rudelson's concentration Lemma 

We now present the main tools we use to select the features in the context of fc-means clustering. Let 
0, G R"^'' be a matrix such that Ail. G contains r columns of A. ^2 is a projection operator onto 

the r-dimensional subsect of features. Let S € W^"^ be a diagonal matrix, so AQS G jR^x*" rescales 
the columns of A that are in AQ. Lituitively, AOS projects down to the chosen r dimensions and 
then rescales the data along these dimensions. The following two lemmas describe two deterministic 
algorithms for constructing such Q and S. 

Lemma 7 (Lemma 11 in [4]). Let G M*^'<" and B G M^i^" with V^V = h- Let r > k. There is 
a deterministic 0{rk^n + iin) time algorithm to construct Q G M'"^*" and S G M^'^*" such that, 

(Tfc(V^J7S) > 1 - ||BQS||f < ||B||f. 

Lemma 8 (Lemma 10 in [4]). Let G M''''", Q G M^^xn^ yXy ^ qTq ^ j^^^ ^ ^ 

There is a deterministic 0{rk'^n + ri\n) time algorithm to construct Q. G M"^^', S G W'^^' such that, 

Q!^S||2 < 1 + 
Moreover, if Q = In, the running time is 0{rk'^n). 



c7fc(V^0S) > 1 - 
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Input: A G M™^", Xj„ G M™^*^, number of clusters k, and number of features r > k. 
Output: C E ]^™x'' containing r rescaled columns of A. 



1: Compute the matrix Vfc G M"-^'^ from the SVD of A. 

' B = ( A -"xJiE ) ^ 

3: Let [fl,S] = DeterministicSamplingl(V^,B,r). 
4: return C = AOS G M™^^ 



Algorithm 1: Supervised Feature Selection (Theorem 2) 




Lemmas 7 and 8 are generalizations of the original work of Batson et al [3], which presented a 
deterministic algorithm which operates only on V. Lemmas 7 and 8 are proved in [4] (see Lemmas 
10 and 11 in [4]). We will use Lemma 7 in a novel way: we will apply it to a matrix B of the form 



B 



so we will be able to control the sum of the Frobenius norms of two different matrices B^, B2, which 
is all we need in our application. The above two lemmas will be used to prove our deterministic 
results for feature selection, i.e. Theorems 2 and 4. 

We will also need the following result, which corresponds to the celebrated work of Rudelson and 
Virshynin [31, 32] and describes a randomized algorithm for constructing matrices Q and S. The 
lower bound with the optimal constants 4 and 20 was recently proved as Lemma 15 in [26]. The 
Frobenius norm bounds are straightforward; a short proof can be found as Eqn. 36 in [10]. This 
lemma will be used to prove our hybrid randomized result for feature selection, i.e. Theorem 5. 

Lemma 9. Let G M''''", B G M^^''" and Q G M^^^^", with V'^V = h- Let r > Ak\u{k). 
Algorithm 6 in 0{nk + r log(r)) time constructs G M'"^*" and S G M^'^*" such that w.p. 0.9, 



aliV^nS) > 1 - x/4fcln(20A:)/r; E 

E[||Ql^S||y = IIQII^. 



|BQS||^ 



4 Algorithms 

This section gives the details of the algorithms relating to Theorems 2, 4, and 5 (recall that there 
is no algorithm for Corollary 3 since our result is only existential). The resulting algorithms are 
presented as Algorithms 1, 2, and 3, respectively. The intuition for Algorithms 1, 2, and 3 stems 
from the following crucial Lemma, which lies at the heart of the analysis of our algorithms and 
the proofs of Theorems 2, 4, and 5. In the following lemma, the sampling and rescaling matrices 
Q G M"^'' and S G W^^ are arbitrary, modulo the rank restriction in the lemma. 
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Input: A G M™^", number of clusters k, and number of features r > k. 
Output: C E ^mxr containing r rescaled columns of A. 

1: Compute the matrix Y^eW^^^ from the SVD of A. 
2: Let [0,S] = DeterministicSamplingll(V^, I„, r). 
3: return C = AOS G M™^^ 

Algorithm 2: Unsupervised Feature Selection (Theorem 4) 



Input: A G M"'^'^, number of clusters fc, and number of features fc < r < 4/cln/c. 
Output: C G l^^x*" containing r rescaled columns of A. 

1: Compute the matrix Z G W^^^ from the approximate SVD of A in Lemma 6: 

Z = FastApproxi'mateSVD{A,k, i) . 
2: Let c = 16fcln(20fc). 

3: Let [r2i,Si] = RandomizedSampling(Z"'", c). 

4: Compute V G with the top k right singular vectors of Z^'^f^iSi G M'^^'^. 

~ T 

5: Let [17, S] = DeterministicSamplingll(V ,1c, r). 
6: return C = AJ7iSiJ7S G M™^^ 

Algorithm 3: Randomized Unsupervised Feature Selection (Theorem 5) 

Lemma 10. Fix A G M"*^", A; > 0, and Xi„ G M'^^'=. Let n G M'^^'' and S G M'''"" be any matrices, 
so C = AQS G M"*^^. Let 'Kout be the output of some 'y- approximation algorithm on C, k. Then, if 
rank(VjOS) = k, 

HA Y xT Al|2 < HA X . ||2 , ||(A - X.„x;^„A)OS||| + ||(A - AVfeV^)OS||| 

The message of this lemma is very usefull. It tells us that a provably accurate feature selection 
algorithm need only control three error terms 

||(A-Xi„xTA)f7S|||; ||(A-AVfcVT)OS|||; aUvlnS). 

Another requirement of the matrices Q and S is that 

rank(V^fiS) = k, 

which can be achieved if 

afc(Vjf]S) > 0. 

Notice that DeterministicSamplingI in Algorithm 1 in Theorem 2 controls the smallest singular 
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value of an orthonormal matrix and the Frobenius norm of some other matrix B. We carefully chose 



B=( ^-^^'^^^ 
y A — Xj„Xj„A 

so that when we control ||B||p, we can control the sum of the squared Frobenius norms of the two 
submatrices, which is all we need (see the proof of Theorem 2). 

The term ||(A — Xj.„X^A)OS||p in the equation of the above lemma indicates why Algorithm 
1 requires the knowledge of some partition Xj„. In order to remove the dependence on Xj„ and 
obtain our unsupervised feature selection result, we use spectral submultiplicativity to manipulate 
this term as follows, 

\\{A-X,nXlA)nS\\l < ||(A-X,„xTA)||2||I„f]S||2. 

Similarly, we obtain, 

||(A - AVkyJ)nS\\l < ||(A - AV,.V^)|||||I„J7S||i. 

These two results along with Lemma 8 give the intuition for the method DeterministicSamplingll 
in Algorithm 2 that accomplishes the claim of Theorem 4 (Algorithm 3 uses this same trick as well) . 



4.1 Running Times 

We now comment on the running times of Algorithms 1,2, and 3. Algorithm 1 computes the matrix 
Vfc in 0{mnm.m{m,n}) time; then, A — AV^Vj and A — Xj.„X^A can be computed in 0{mnk). 
DeterministicSamplingl takes time 0{rk'^n + mn), from Lemma 7. Overall, the running time of 
Algorithm 1 in Theorem 2 is 0(mnmin{m, n} + r/c^n). Similar arguments suffice to show that 
this is also the case for Algorithm 2 in Theorem 4. Finally, the running time of Algorithm 3 is 
O {rank + rk^ log(/c) + r logr), since it employs the approximate SVD of Lemma 6, the randomized 
technique of Lemma 9, and the deterministic technique of Lemma 7. 

4.2 Description of the Algorithms Accomphshing Lemmas 7, 8, and 9 

DeterministicSamplingl. At a high level. Algorithm 4 selects columns in a greedy way such that 
the desired bounds hold at every iteration of the algorithm (r iterations in total). The key equation 
is in step 5 of the algorithm. If an index ij- satisfies this equation, then the corresponding columns 
satisfy the desired bounds as well. Thus, a key requirement in the algorithm is that such an index 
ir exists in every step. These two observations give the high level idea in DeterministicSamplingl. 

To describe the algorithm in more detail, it is convenient to view the input matrices as two sets 
of n vectors, V""" = [vi, V2, . . . , v„] and B = [bi, b2, . . . , b„]. Given k and r > k, introduce the 
iterator r = 0, 1, 2, r — 1, and define the parameter = t — \frk. For a square symmetric matrix 
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Input: V'^ = [vi,V2,... ,v„] G M'^^", B = [bi,b2,. . . ,b„] G R^^"""-, and r > k. 
Output: Sampling matrix f2 G M"^'' and rescaling matrix S G R^^*". 

1: Initialize Aq = Ojtxfci ^ = Onxr^ and S = Orxr- 
Set constants 5b = ||B|||(1 — y^k/r)~^; 6l = 1- 
for T = to r — 1 do 

Let Lr = T — \frk\ Ub = "rb-^. 

Pick index v G {1, 2, n} and number > (see text for the definition of [/, V): 



2: 
3: 
4: 
5; 



6: Update A,- = A^_i + irVj^v^^; set Jl^.r+i = 1 and St-+i,t+i = 
7: end for 

8: Multiply all the weights in S by 



Y^r ^(1 — y/kjr). 



9: Return: Q and S. 



Algorithm 4: DeterministicSamplingI (Lemma 7) 



A G M with eigenvalues Ai, . . . , Afc, v G M , L G M, define 



i=l 



and let L(v, J^,, A, l) be defined as 
L(v,(5i,A,L) 

where 



vT(A-L'I,)-2v 



(l',A)-0(l,A) 



V^(A-L'lfc)"V, 



l' = L + (5l = L + 1. 

For a vector z and scalar 5 > 0, define the function 



C/(z,(5) = (5~^z^z 



At each iteration r, the algorithm selects i,-, tr > for which 

f/(bi,,5p) < t-^ < L(vi^,(5L, A,w). 

Such irj exist, as was shown in [4]. The running time of the algorithm is dominated by the search 
for an index v satisfying 
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Input: VT = [vi,V2,...,v„] G M'^x", Q = [qi, qs, • • . , q^] G M^^^", and r > k. 
Output: Sampling matrix f2 G M"'^'' and rescaling matrix S G R^^*". 

1: Initialize Aq = Okxk, Bq = O^^xfe' ^ = ^nxr, and S = Orxr- 
2: Set constants 5q = (1 + ^2/?^) (^1 - \/kJr^ ; 5l = 1. 
3; for T = to r — 1 do 
4: Let Lt- = r — \/Vk; = 5q (r + \^i2r) 

5: Pick index v G {1, 2, n} and number tr > (see text for the definition of U, L): 

6: Update A^- = At-_i + tr^i^vj^; = Br^i + irqi^qi"^, and 

set ^li^^r+i = 1, St-+i,t+i = l/v^- 
7: end for 



8: Multiply all the weights in S by 




9: Return: 0, and S. 



Algorithm 5: DeterministicSamplingll (Lemma 8) 



(one can achieve that by exhaustive search). One needs (j){L,A), and hence the eigenvalues of A. 
This takes 0{k^) time, once per iteration, for a total of 0{rk^). Then, for z = 1, . . . ,n, we need to 
compute L for every Vj. This takes 0{nk'^) per iteration, for a total of 0{rnk'^). To compute U , we 
need hjhi for i = 1, . . . , n, which need to be computed only once for the whole algorithm and takes 
0{i\n). So, the total running time is 0{nrk'^ 

DeterministicSamplingll. Algorithm 5 is similar to Algorithm 4; we only need to define the 
function U. For a square symmetric matrix B G M^^^^^ with eigenvalues Ai, . . . , Xe^, q G M^^, u G M, 
define: 



1=1 



and let [/(q, 5q, B, u) be defined as 



^7(q,5Q,B,u) = ^^^\-''''';^;"^ -c,^{B-v%,r\ 

where 

u' = u + Jq = u + (1 + (1 - T^a) . 

The running time of the algorithm is 0{nrk'^ + nrl"^). 

RandomizedSampling. Finally, Algorithm 6 takes input only V; it selects the columns ran- 
domly based on a probability distribution that is computed via V. It needs 0{nk) to compute the 
probabilities and 0{n + r logr) to select the indices, a total of 0{nk + r logr) time. 
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Input: = [vi, V2, . . . , v„] G M'^^", and the number of sampled columns r > 4A;lnA;. 


Output: Sampling matrix i7 G W^^"^ and rescaling matrix S G M''^''. 


1 


111 1 1 2 

For i = 1, n compute pi = Vj 2- 


2 


Initialize = 0„xr and S = O^xr- 


3 


for r = 1 to r do 


4 


Select index i G {l,2,...,n} independently with the probabihty of selecting index i 




equal to pi. 


5 


Set Oj^T- = 1 and Sr^r = I/a/k^- 


6 


end for 


7 


Return: and S. 



Algorithm 6: RandomizedSampling (Lemma 9) 



5 Proofs 

We first prove Lemma 10, which is the main technical contribution of this work. Then, combining 
Lemma 10 along with Lemmas 7, 8, and 9, we prove Theorems 2, 4, and 5, respectively. 

5.1 Proof of Lemma 10 

We start by manipulating the term ||A — XomjXJ^^A||^. First, we decompose the matrix A as 

A = Afc + (A-Afc). 
Now, from the Pythagorean Theorem for matrices, ^ we have that: 

||A - XoutXjut A Hp = ||(Im - Xotj4Xo„f)Afc + (Im - XomjXJ„J(A - Afc)||| 

= ||(Im - Xo„iXo„j)Afc||| + ||(Im - XoutXo„j)(A - Afc)||p. 

Using that 1^ — y^out^ut is a projection matrix and that 

||A-Afc||^<||(U-X,piXT,)A|||, 

we obtain 

||A - Xo„tXj„jA||p < \\{\m - Xo„fX^tjj)Afc||| + \\{\m - XoptX^pt)A|||. 

We now bound the first term. Given and S, for some residual matrix Y G M™^", let 

Afc = A17S(Vjj7S)+Vj + Y. 

^Let Yi,Y2 e R'"X" satisfy YiYi' = 0,„xm. Then, ||Yi + YaHl = ||Yi||| + \\Y2\\l. 
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Then, 



||(U-X„„iXL)Afc||2 < 2||(U-X,„iXDAOS(VTj7S)+VT||2 +2||Y||2 

< 2||(U - X,„,XL)A0S||2 ||(VT0S)+||2 + 2||Y||2 

< 27||(U-X„xT)AJ]S||2||(vTf]S)+||2 + 2||Y||2 

In (a), we used ||Yi + Y2III < 2||Yi||p + 2||Y2||p (for any two matrices Yi, Y2), which follows from 
the triangle inequality of matrix norms; further we have removed the projection matrix !„ — Xo^tX^j 
from the second term, which can be done without increasing the Frobenius norm. In (b), we used 
spectral submultiplicativity and the fact that is orthonormal, and so it can be dropped without 
increasing the spectral norm. Finally, in (c), we replaced ^out by Xj„ and the factor 7 appeared in 
the first term. To understand why this can be done, notice that, by assumption, 'Kout was constructed 
by running the 7- approximation on C = AfiS. So, for any indicator matrix X: 

||(i„ - Xout^out)^^s\\l < 7||(i„ - xx^)Ans\\l. 

Setting X = Xi„ shows the claim. Finally, we bound the term ||Y|||. Recall that 

Y = Afc - AQS(V^J^S)+V^ 

= Afc - AfcJ7S(V^J7S)+V^ - (A - Afc)J^S(V^f)S)+V^. 



Since A^ = UfcE^V^, we have that 

Akns(vlns)+vl = UfcSfcV^j^s(v^f]S)+vJ 

where the last equality follows because rank(Vji7S) = k, and so V^r2S(Vji7S)+ = 1^. Thus, the 
first two terms in the expression for E cancel and we have 

||Y||| = ||(A-A,)f^S(V^QS)+V^||2 

< ||(A-Afc)J^S|||||(vTfiS)+Vfc||i 

< ||(A-A,)J^S|||||(V^QS)+||i 
||(A-Afc)17S||| 



In the first 3 steps, we have used spectral submultiplicativity, and in the last step we have used the 
definition of the spectral norm of the pseudo-inverse. Combining all these bounds together (and 
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using 7 > 1): 



A - Xout^JutMlp < l|A - XoptXjpiAlll + 27 



||(In - X,„X^JAOS||^ + ||(A - Afc)OS||| 



The lemma now follows because 



A XopfXgpjAllp < ||A Xj„Xj„A 



which holds by the optimality of the indicator matrix Xopt on the high dimensional points containing 
in the rows of A. 

5.2 Proof of Theorem 2 

Theorem 2 will follow by using Lemmas 10 and 7. We would like to apply Lemma 10 for the matrices 
and S constructed with DeterministicSamplingI; to do that, we need 



rank(V^J^S) = k. 



This rank requirement follows from Lemma 7, because 





> 0. 



Hence, Lemma 10 gives 



A - XouAtMll < l|A - X„XT A||| + 27 



||(A - X,,,XlA)nS\\l + ||(A - AV,V^)OS||^ 



We can now use the bound for cr|(Vjr2S) from Lemma 7: 




Also, we can use the Frobenius norm bound for B of Algorithm 4: 



BQSIll < ||B 



From our choice of B 



BfiSlll = ||(A - AVfeVT)OS||| + ||(A - Xi^xT A)0S||2 , 



and 



B||2 = ||A - AVfeVjill + ||A - X,„XT A 
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so, 



||(A - AYkVlmil + ||(A - XinXiA)nS\\l < ||A - AVfcVj||| + ||A - X^^X^ A||2 . 
Combine all these bounds together and use 

II A - AVfcV^II^ < Topt < F{V,Sin), 

to wrap up. 

5.3 Proof of Theorem 4 

Theorem 4 follows by combining Lemmas 10 and 8. The rank requirement in Lemma 10 is satisfied 
by the bound for the smallest singular value of VjOS in Lemma 8. In Lemma 10, let 

So, 



I A - Xout^outMlp < l|A - XoptX.optM\F + 27 



||(A - XoptXl,A)nS\\l + ||(A - AVfcVT)f]S||| 



Now, from spectral submultiplicativity and using 

||A - AVfcVT||2 < ||A - XoptXl^AWl, 

we obtain, 

\\{A-XoptXl,A)nS\\l < ||(A-Xop<xT^A)||2||I„f]S||i, 

and 



||(A-AVfcV,!')f)S||^ < ||(A-AVfcVj)|||||I„J7S"2 



2 



< ||(A-X,piXT^A)||2||L„OS||2. 
The result now follows by using the following bounds from Lemma 8, 

||I„J7S||2 < 1 + V^- 

5.4 Proof of Theorem 5 

To prove the theorem, we are going to need a more general version of our crucial structural result. 
Lemma 10. 

Lemma 11. Fix A G M'"^^", A; > 0, and Xj„ G M.'^'"' . Let Q G M"^'' and S G W'' be any matrices, 
so C = AnS G M™^*". Let Z G M"""^' be any orthonormal matrix, such that A = AZT7 + E (note 
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that E = A — AZZ^ G Let 'Kout be the output of some approximation algorithm on C, k. 

Then, i/rank(Z"'^OS) = k, 

HA X xTaI|2<||A ^ ^T,||2 , , J|(A-X,„X^^A)OS||| + ||E^]S||^ 

cr^(ZM2S) 

Proof. The proof follows exactly the same argument as the proof of Lemma 10, after replacing 
Vfc = Z, Afc = AZZ^, and Ap_k = E = A - AZZ^. ■ 

Notice that Lemma 10 is a special case of the above lemma by assuming Z = V^, the matrix of 
the top k right singular vectors of A, in which case E = A — A^. 

To prove Theorem 5, we start with the general bound of Lemma 11; to apply the lemma, we 
need to satisfy the rank assumption, which will become clear shortly, during the course of the proof. 

The algorithm of Theorem 5 constructs the matrix Z by using the algorithm of Lemma 6 with e 
set to a constant, e = ^- Using the same notation as in Lemma 6, 

E = A - AZZ'^, 

and 

E[||E|||] <^||A-A,||2. 
We can now apply the Markov's inequality, to obtain that with probability at least 0.9, 

\\Eg < 15||A- Afelll. 

The randomized construction in the third step of Algorithm 3 gives sampling and rescaling 
matrices and Si; the deterministic construction in the fifth step of Algorithm 3 gives sampling 
and rescaling matrices Q and S. To apply Lemma 11, we will choose 

since the Lemma gives us the luxury to pick any indicator matrix Xj„. Note that we do not need to 
actually compute Xopt in the algorithm; we are just using it in the proof to get the desired result. 

Algorithm 3 first selects c columns using QiSi G M"^^. Let Y = V^JliSi G M*^^^, and consider 
its SVD, 

Y = USV^, 

with U G R'''''', S G R'''"', and V G M^'"'. Lemma 9 now implies that with probability 0.9, 

afc(Y) > 1 - y^Akln{20k)/c = ^, 

~ T ~ T 

because c = 16fcln(20A;), which means that rank(V ) = k. Since V is an input to Deterministic 
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Sampling II, it follows from Lemma 8 that 



afc(V nS) > 1 -^/k/^>0. 



Since, 

we can apply Lemma 11: 

||A - Xout^outMlF < l|A - X,„X4A||^ + 27 



rank(Yf7S) = rank(V nS) = k 



T A ii2 ^ II A V vT A ii2 , ^_ ll(A ~ ^optXjp(A)r2iSir2S||^ + llEOiSiOSII^ 



ct2(zTOiSiOS) 



To bound the denominator, observe that 



Now, since U is a full rotation. 



Thus, 



a|(Z^OiSiOS) > -(1 - y^k/ry. 



We now bound ||(A - XoptXjpiA)OiSiJ7S||p: 

\\{A-Xopt^ltA)niSinS\\l = ||(A-XoptXjpiA)17iSiI,OS||2 

< ||(A - Xoptx:^ptA)niSi\\l\\icns\\l 

< (l + v^)2||(A-X„piXT,A)0iSi||^, 

where the last inequality is because Ic is the input to DetrministicSampling II whose spectral norm is 
controlled, with £2 = c. Similarly, we bound ||Er2iSir2S||p as 

||EQiSiQS||p < (1 + \/c7^)2||EOiSi||p. 



From Lemma 9, 



and 



E[||EJ)iSi|||] = ||E|||, 



E 



||(A-XoptXT,A)17iSi 



I A XoptXgp^A\\p. 



By a simple application of Markov's inequality and the union bound, both equations below hold 
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with probability at least 0.6, 



||Er2iSi||p < 5||E||p; 
||(A-X„piXT,A)f]iSi||^ < 5\\A-XoptX^l,A\\l. 

We further manipulate the bound for the matrix E as 

llEOiSilll < 5||E||| < 75||Ap_fc|||. 

This now introduces another failure probability 0.1 (from the discussion in the beginning of this 
section). Putting all these bounds together, and using that 



opt 



I A XopjX^^ A||^, 



and 



we conclude the proof as follows. 



lA 



p-k\\F 



opt: 



27(||(A-XoptXT A)f^iSiJ]S 



cj2(zTf]iSif)S) 



|EOiSiOS||p 



< 



opt 



id 



k/r)^ 
(1 + 



80(1 + y/c/^f 



Q^O-fJ^opt 



(1 - v^)^ 

jTopt (640 + 0(7^ + c/r) 
• 0{c/r), 



where the last expression follows because for k < r < c, the dominant term is 0{c/r). Since 
c = 0(A;logA;), the result follows. Note, we have made no attempt to optimize constants. 

Finally, the failure probability of the theorem follows by a union bound on the events of the 
bounds of the equations 



|Er2iSi||p ^ 5||E||p; 



\\{A-XoptKpt^)^i^i\ 



< 5||A — XoptXgpjA 



and the lower bound of Lemma 9. 



6 Related work 

Feature selection has received considerable attention in the machine learning and pattern recognition 
communities. A large number of different techniques appeared in prior work, addressing the feature 
selection within the context of both clustering and classification. Surveys include [15], as well as [16], 
which reports the results of the NIPS 2003 challenge in feature selection. Popular feature selection 
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techniques include the Laplacian scores [19], the Fisher scores [13], or the constraint scores [36]. 
None of these feature selection algorithms have theoretical guarantees on the performance of the 
clusters obtained using the dimension-reduced features. 

We focus on the family of feature selection methods that resemble our feature selection techniques, 
in that they select features by looking at the right singular vectors of the matrix containing the 
data (the matrix A). Given the input m x n object-feature matrix A, and a positive integer k, 
a line of research tries to construct features for (unsupervised) data reconstruction, specifically for 
Principal Components Analysis (PCA). PCA corresponds to the task of identifying a subset of k 
linear combinations of columns from A that best reconstruct A. Subset selection for PCA asks to 
find the columns of A that reconstruct A with comparable error as do its top Principal Components. 
Jolliffe [20] surveys various methods for the above task. Four of them (called Bl, B2, B3, and 
B4 in [20]) employ the Singular Value Decomposition of A in order to identify columns that are 
somehow correlated with its top k left singular vectors. In particular, B3 employs a deterministic 
algorithm which is very similar to Algorithm 3 that we used in this work; no theoretical results 
are reported. An experimental evaluation of the methods of [20] on real datasets appeared in [21]. 
Another approach employing the matrix of the top k right singular vectors of A and a Procrustes- 
type criterion appeared in [22]. From an applications perspective, [34] employed the methods of [20] 
and [22] for gene selection in microarray data analysis. 

Feature selection for clustering seeks to identify those features that have the most discriminative 
power among all the features. [25] describes a method where one first computes the matrix G 
M"^'^', and then clusters the rows of by running, for example, the A;- means algorithm. One finally 
selects those k rows of that are closest to the centroids of the clusters computed by the previous 
step. The method returns those columns from A that correspond to the selected rows from V^. A 
different approach is described in [8]. This method selects features one at a time; it first selects the 
column of A which is most correlated with the top left singular vector of A, then projects A to this 
singular vector, removes the projection from A, computes the top left singular vector of the resulting 
matrix, and selects the column of A which is most correlated with the latter singular vector, etc. 
Greedy approaches similar to the method of [8] are described in [27] and [28]. There are no known 
theoretical guarantees for any of these methods. While these methods are superficially similar to 
our method, in that they use the right singular matrix and are based on some sort of greedy 
algorithm, the techniques we developed to obtain theoretical guarantees are entirely different and 
based on linear-algebraic sparsification results [3, 4]. 

The result most closely related to ours is the work in [5, 7]. This work provides a randomized 
algorithm which offers a theoretical guarantee. Specifically, for r = il(A;e~^ log A;), it is possible 
to select r features such that the optimal clustering in the reduced-dimension space is a (3 -|- e)- 
approximation to the optimal clustering. Our result improves upon this in two ways. First, our 
algorithms are deterministic; second, by using our deterministic algorithms in combination with 
this randomized algorithm, we can select r = 0{k) features and obtain a competitive theoretical 
guarantee. 

Finally, we should mention that if one allows linear combinations of the features (feature extrac- 
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tion), there are algorithms that offer theoretical guarantees. First there is the SVD itself, which con- 
structs k (mixed) features for which the optimal clustering in this feature space is a 2-approximation 
to the optimal clustering [11]. It is possible to improve the efficiency of this SVD algorithm con- 
siderably by using the approximate SVD (as in Lemma 6) instead of the exact SVD to get nearly 
the same approximation guarantee with k features. The exact statement of this improvement can 
be found in [7]. Boutsidis et al. [6] shows how to select 0(/ce~^) (mixed) features with random 
projections and also obtaining a (2 -|- e)-guarantee. While these algorithms are interesting, they do 
not produce features that preserve the integrity of the original features. The focus of this work is 
on what one can achieve while preserving the original features. 
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